Goto

Collaborating Authors

 Ceará


Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models

arXiv.org Machine Learning

Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.


Gromov-Wasserstein Methods for Multi-View Relational Embedding and Clustering

arXiv.org Machine Learning

Learning low-dimensional representations from multi-view relational data is challenging when underlying geometries differ across views. We propose Bary-GWMDS, a Gromov-Wasserstein-based method that operates directly on distance matrices to learn a consensus embedding preserving shared relational structure. By leveraging intrinsic distances, the approach naturally handles nonlinear distortions across views. We also introduce Mean-GWMDS-C, a clustering-oriented formulation that averages distance matrices and learns reduced-support representations via a consensus Gromov-Wasserstein transport. Experiments on synthetic and real-world datasets show that the proposed framework yields stable and geometrically meaningful embeddings.


Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport

arXiv.org Machine Learning

Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Classical approaches often rely on feature concatenation or explicit alignment assumptions, which become restrictive under heterogeneous geometries or nonlinear distortions. In this work, we propose two geometry-aware multi-view embedding strategies grounded in Gromov-Wasserstein (GW) optimal transport. The first, termed Mean-GWMDS, aggregates view-specific relational information by averaging distance matrices and applying GW-based multidimensional scaling to obtain a representative embedding. The second strategy, referred to as Multi-GWMDS, adopts a selection-based paradigm in which multiple geometry-consistent candidate embeddings are generated via GW-based alignment and a representative embedding is selected. Experiments on synthetic manifolds and real-world datasets show that the proposed methods effectively preserve intrinsic relational structure across views. These results highlight GW-based approaches as a flexible and principled framework for multi-view representation learning.


Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

arXiv.org Artificial Intelligence

Computer manufacturers offer platforms for users to describe device faults using textual reports such as "My screen is flickering". Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). W e conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. W e achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct


Classification of User Reports for Detection of Faulty Computer Components using NLP Models: A Case Study

arXiv.org Artificial Intelligence

Computer manufacturers typically offer platforms for users to report faults. However, there remains a significant gap in these platforms' ability to effectively utilize textual reports, which impedes users from describing their issues in their own words. In this context, Natural Language Processing (NLP) offers a promising solution, by enabling the analysis of user-generated text. This paper presents an innovative approach that employs NLP models to classify user reports for detecting faulty computer components, such as CPU, memory, motherboard, video card, and more. In this work, we build a dataset of 341 user reports obtained from many sources. Additionally, through extensive experimental evaluation, our approach achieved an accuracy of 79% with our dataset.


DataMan: Data Manager for Pre-training Large Language Models

arXiv.org Artificial Intelligence

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.


Generative Active Adaptation for Drifting and Imbalanced Network Intrusion Detection

arXiv.org Artificial Intelligence

Machine learning has shown promise in network intrusion detection systems, yet its performance often degrades due to concept drift and imbalanced data. These challenges are compounded by the labor-intensive process of labeling network traffic, especially when dealing with evolving and rare attack types, which makes selecting the right data for adaptation difficult. To address these issues, we propose a generative active adaptation framework that minimizes labeling effort while enhancing model robustness. Our approach employs density-aware active sampling to identify the most informative samples for annotation and leverages deep generative models to synthesize diverse samples, thereby augmenting the training set and mitigating the effects of concept drift. We evaluate our end-to-end framework on both simulated IDS data and a real-world ISP dataset, demonstrating significant improvements in intrusion detection performance. Our method boosts the overall F1-score from 0.60 (without adaptation) to 0.86. Rare attacks such as Infiltration, Web Attack, and FTP-BruteForce, which originally achieve F1 scores of 0.001, 0.04, and 0.00, improve to 0.30, 0.50, and 0.71, respectively, with generative active adaptation in the CIC-IDS 2018 dataset. Our framework effectively enhances rare attack detection while reducing labeling costs, making it a scalable and adaptive solution for real-world intrusion detection.


Data denoising with self consistency, variance maximization, and the Kantorovich dominance

arXiv.org Artificial Intelligence

We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem.


Text-to-SQL based on Large Language Models and Database Keyword Search

arXiv.org Artificial Intelligence

Text-to-SQL prompt strategies based on Large Language Models (LLMs) achieve remarkable performance on well-known benchmarks. However, when applied to real-world databases, their performance is significantly less than for these benchmarks, especially for Natural Language (NL) questions requiring complex filters and joins to be processed. This paper then proposes a strategy to compile NL questions into SQL queries that incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. The paper details how the precision and recall of the schema-linking process are improved with the help of the examples provided and the keyword-matching service that the KwS platform offers. Then, it shows how the KwS platform can be used to synthesize a view that captures the joins required to process an input NL question and thereby simplify the SQL query compilation step. The paper includes experiments with a real-world relational database to assess the performance of the proposed strategy. The experiments suggest that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches. The paper concludes by discussing the results obtained.


LegalScore: Development of a Benchmark for Evaluating AI Models in Legal Career Exams in Brazil

arXiv.org Artificial Intelligence

This research introduces LegalScore, a specialized index for assessing how generative artificial intelligence models perform in a selected range of career exams that require a legal background in Brazil. The index evaluates fourteen different types of artificial intelligence models' performance, from proprietary to open-source models, in answering objective questions applied to these exams. The research uncovers the response of the models when applying English-trained large language models to Brazilian legal contexts, leading us to reflect on the importance and the need for Brazil-specific training data in generative artificial intelligence models. Performance analysis shows that while proprietary and most known models achieved better results overall, local and smaller models indicated promising performances due to their Brazilian context alignment in training. By establishing an evaluation framework with metrics including accuracy, confidence intervals, and normalized scoring, LegalScore enables systematic assessment of artificial intelligence performance in legal examinations in Brazil. While the study demonstrates artificial intelligence's potential value for exam preparation and question development, it concludes that significant improvements are needed before AI can match human performance in advanced legal assessments. The benchmark creates a foundation for continued research, highlighting the importance of local adaptation in artificial intelligence development.